-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Storage cleaner] Add wandb path implementation #400
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think resolving the W&B run based on the W&B run name is going to work well in practice, at least for my runs where I typically use the same W&B name for each restart.
Alternatively:
- Use the existing name/group filters to get a set of potential matches
- Then filter those matches by comparing all fields in the config
This should work since we save the train config to the W&B run config.
Co-authored-by: Pete <[email protected]>
This would help filter out some runs but I assume that you sometimes use the exact same config for several runs (I have in the past)? Are you concerned about that case too? |
I probably have some runs like this due to failures, but in general I try to remove failed runs immediately from W&B. When that happens with this script we could print out links to each of those runs so the user could check to see if some of duplicates can be removed from W&B. |
223eb6a Filtering by config took ~90 lines. |
Hmm you might be able to simplify that a bit by trying to load the W&B config into a |
b4ac61e Somehow I didn't even think about this. It looks like it works fine. |
This PR adds a basic implementation for getting a wandb path for a given run directory. This may have issues in some common cases:
save_overwrite
and so the top-levelconfig.yaml
does not match all checkpoints, then the resulting wandb path won't be correct for all checkpoints of the "run". Dealing with this programmatically seems more messy than is worth.config.yaml
, then the script exits with error. Using a checkpoint'sconfig.yaml
could lead to the previous scenario.I am hoping that most of the runs worth keeping do not face the above issues. I can improve how the storage cleaner deals with these scenarios if we believe they are more important than I had expected.